This projects aims at comparing the performance of several Machine Learning models (ML models) under different data contexts.
To achieve our goal we stacked against each other pairs of different models using Synthetic Datasets represents often polarizing and extreme situations and therefore exposing the main “decision” characteristics of each model.
The focus will be solely on classification problems and models will be compared against each other using Accuracy and Area under the Roc Curve (AUC ROC) as metrics. Since we have full control over the dataset distribution, the optimal Bayes Boundary will be used as baseline
This project is organized in the following way:
An introduction containing general assumptions and how to reproduce the results,
A chapter for each model pair containing the synthetic data rules, the bayes optimal boundary (BOB), a small models explanation and rational for each dataset, model fit and metrics as well as the prediction area and discussion of the results,
An overall conclusion
Throughout this project the following assumptions were made and approaches were used:
Dependent variables (target) are of date type factor and all Independent variables (features) are of type numeric,
Since the datasets are synthetically generated, no pre-processing was made,
Datasets were built using statistical distributions and/or classification rules which may have no resemblance to reality,
To the extant it is possible a dataset was created using a binomial categorical variable and a two features,
For the sake of model comparing, the default hyper parameter value were used. This are provided by Parsnip R package without post-processing and hyper parameter optimization. It is possible, although unlikely, that a different package or under different hyper parameters could result different conclusions,
For calculating metrics a cross-validations with 10 folds and no repetition was used.
R Core Team (2022)
The file _main.rmd compiles all work. In order to run the following script should be copied into a R script and sourced on the notebook.
######## Copy of script to run "./scripts/helper.R"
# libraries ---------------------------------------------------------------
library(ggplot2)
library(dplyr)
library(purrr)
library(mvtnorm)
library(tidyr)
library(magrittr)
library(tidymodels)
# Model specific libraries
library(discrim)
library(C50)
library(rpart)
library(LiblineaR)
library(kernlab)
library(keras)
library(nnet)
# Dataset and bayes optimal boundary plot ---------------------------------
#' @name: dataset generator
#'
#' @param size numeric:
#' @param nVar numeric:
#' @param n_g numeric:
#' @param class_fun function:
#' @param treshold numeric:
#' @return results list: list(dataset_plot = dataset_plot, dataset = dataset, cond = grid, border_plot = dataset_plot_border)
dataset_gen_unif <- function(size = 1000, nVar = 2, n_g = 2, class_fun = NULL, treshold = 0.5)
{
# Verify if inputs are correct data types
stopifnot("A numeric value needs to be provided for the size of the dataset" = is.numeric(size))
stopifnot("A numeric value needs to be provided for the number of variables to be produced" = is.numeric(nVar))
stopifnot("The classification function needs to be of the type function" = is.function(class_fun))
stopifnot("Number of variables needs to be equal or above 2" = nVar >= 2)
# Random sample of data
sample <- replicate(nVar,stats::runif(size, min = 0, max = 10))
sample <- dplyr::as_tibble(sample) %>% magrittr::set_colnames(paste0("x", 1:nVar))
# Applies classification function if nVar = 2
dataset <- sample %>%
mutate(
g = purrr::pmap_dbl(., class_fun ),
g = factor(g)
)
# Creates plot
dataset_plot <- ggplot(dataset, aes(x1, x2, color = factor(g))) +
geom_point(size = 3, shape = 1) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
theme_bw() +
theme(
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
axis.line = element_blank(),
legend.position="bottom",
panel.border = element_rect(colour = "black", fill=NA, size=1),
panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
plot.caption.position = "plot"
) +
scale_colour_brewer(palette = "Set1")
## Build grid for contour
x1_range <- seq(0, 10, by = 0.05)
x2_range <- seq(0, 10, by = 0.05)
grid <- expand.grid(x1 = x1_range, x2 = x2_range)
# conditional probability of (x1, x2) given y = 0
grid <- grid %>%
mutate(
g = purrr::pmap_dbl(., class_fun ),
g = factor(g)
)
l <- list()
for (i in 1:n_g) {
l[[i]] <- ifelse(grid$g == i, 1, 0)
}
# Calculates conditional probabilities
conditional_prb = do.call(cbind.data.frame, l) %>%
set_colnames(paste0("px_G",0:(n_g-1))) %>%
mutate(
py0_x = treshold * px_G0,
py1_x = (1-treshold) * px_G1,
bayesborder = py1_x - py0_x ,
predictclass = ifelse(py0_x > py1_x, 0, 1) # posterior class
)
grid <- cbind(grid, conditional_prb)
dataset_plot_border <- dataset_plot +
geom_contour(data = grid, aes(x = x1,y = x2, z = bayesborder), color = "black", linetype = "dashed", breaks = 0)
# return results
results <- list(dataset_plot = dataset_plot, dataset = dataset, cond = grid, border_plot = dataset_plot_border)
return(results)
}
# Dataset gen with multivariated normal dist ------------------------------
#' @name: dataset generator with multivariated normal distribution
#'
#' @param
#' @return
dataset_gen_mvnorm <- function(l_mu, l_cvm,l_w, size = 1000, nVar = 2, n_g = 2, class_fun = NULL, treshold = 0.5) {
# generates samples
l_sample <- list()
for (i in 1:length(l_mu)) {
s <- cbind(rmvnorm(size/length(l_w), l_mu[[i]], l_cvm[[i]]),i-1)
l_sample[[i]] <- s
}
dataset <- do.call(rbind.data.frame,l_sample) %>%
magrittr::set_colnames( c(paste0("x", 1:nVar),"g" ) ) %>%
mutate(g = factor(g))
# Creates plot
dataset_plot <- ggplot(dataset, aes(x1, x2, color = factor(g))) +
geom_point(size = 3, shape = 1) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
theme_bw() +
theme(
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
axis.line = element_blank(),
legend.position="bottom",
panel.border = element_rect(colour = "black", fill=NA, size=1),
panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
plot.caption.position = "plot"
) +
scale_colour_brewer(palette = "Set1")
## Build grid for contour
x1_range <- seq(min(dataset$x1), max(dataset$x1), by = 0.05)
x2_range <- seq(min(dataset$x2), max(dataset$x2), by = 0.05)
grid <- expand.grid(x1 = x1_range, x2 = x2_range)
# conditional probability of (x1, x2) given y = 0
grid <- grid %>%
mutate(
g = purrr::pmap_dbl(., class_fun ),
g = factor(g)
)
grid_merge <- merge(grid, data.frame( class = 0:(n_g-1) ), all=TRUE)
l <- list()
for (i in 1:n_g) {
l[[i]] <- ifelse(grid$g == i, 1, 0)
}
new_grid <- grid_merge %>% mutate(p_class = ifelse(class == g, 1, 0))
# Calculates conditional probabilities
conditional_prb = do.call(cbind.data.frame, l) %>%
set_colnames(paste0("px_G",0:(n_g-1))) %>%
mutate(
py0_x = treshold * px_G0,
py1_x = (1-treshold) * px_G1,
bayesborder = py1_x - py0_x ,
predictclass = ifelse(py0_x > py1_x, 0, 1) # posterior class
)
grid <- cbind(grid, conditional_prb)
dataset_plot_border <- dataset_plot +
geom_contour(data = grid, aes(x = x1, y = x2, z = bayesborder), color = "black", linetype = "dashed", breaks = 0)
dataset_plot_border_newgrid <- dataset_plot +
geom_contour(data = new_grid, aes(x = x1, y = x2, z = p_class, color = as.factor(class), group = as.factor(class)), bins = 1)
# return results
results <- list(dataset_plot = dataset_plot, dataset = dataset, cond = new_grid, border_plot = dataset_plot_border_newgrid)
return( results )
}
# Classification metrics function -----------------------------------------
#' Calculate metrics for each model
#'
#' @param test_data A data frame
#' @param model A model
#' @return A list with fit, confusion matrix, confusion plot, accuracy, roc curve
#' and auc roc
#'
#' @examples
#'
model_metrics <- function(test_data = NULL, model = NULL ) {
# Verify if inputs are correct data types
stopifnot("A test dataframe should be provided" = is.data.frame(test_data))
stopifnot("A model should be provided" = !is.null(model))
# fit test data
fit_test <-
test_data %>%
bind_cols(
predict(model, new_data = test_data),
predict(model, new_data = test_data, type = "prob"),
) %>%
mutate_if(is.numeric, round, digits= 3) %>%
mutate(
decision = .pred_1 - .pred_0
)
# confusion matrix
confusion_matrix <- conf_mat(fit_test, truth = g, estimate = .pred_class)
confusion_matrix_plot <- autoplot(confusion_matrix, type = "heatmap")
# Accuracy
acc <- accuracy(fit_test, truth = g, estimate = .pred_class)
# Roc curve
roc_curve <- if (nlevels(test_data$g) > 2) {
roc_curve(
fit_test, truth = g,
paste0(".pred_",0):paste0(".pred_",nlevels(test_data$g)-1),
.level = .pred_0) %>%
autoplot()
} else {
roc_curve(fit_test, truth = g, estimate = .pred_0) %>%
autoplot()
}
auc_roc <- if (nlevels(test_data$g) > 2) {
estimator = ifelse(nlevels(test_data$g) > 2, "macro_weighted",NULL)
roc_auc(fit_test,
truth = g,
paste0(".pred_",0):paste0(".pred_",nlevels(test_data$g)-1),
estimator = estimator)
} else {
roc_auc(fit_test, truth = g, .pred_0)
}
results <- list(fit = fit_test,
cf_matrix = confusion_matrix,
cf_plot = confusion_matrix_plot,
acc = acc,
roc_curve = roc_curve,
auc_roc = auc_roc)
return(results)
}
# model_fit ---------------------------------------------------------------
#' Fits and compare two workflows applied to 2 datasets
#'
#' @param datasets A list of datasets to compare
#' @param workflows A list containing elements of type workflow
#' @param folds Integer number of folds for cross validation, default = 10
#' @return dsa
#' @example
model_fit_compare <- function(data, workflows, folds = 10) {
# Verify if inputs are correct data types
stopifnot("Datasets should be a list of dataframes" = is.list(data))
stopifnot("Workflows need to be of type workflow and passed as list" = is.list(workflows))
# Define variables
datasets <- lapply(data, function(f) f$dataset)
grids <- lapply(data, function(f) f$cond)
plots <- lapply(data, function(f) f$border_plot)
names(datasets) <- paste0("dataset",1:length(datasets))
names(grids) <- paste0("dataset",1:length(datasets))
names(plots) <- paste0("dataset",1:length(datasets))
# Split train and test
splits <- lapply(datasets, rsample::initial_split, prop = 0.8)
split_dataset <- list(
train = lapply(splits, rsample::training),
test = lapply(splits, rsample::testing)
)
# Fit control and resamples
fit_control <- control_resamples(save_pred = TRUE, save_workflow = TRUE)
split_dataset$folds <- lapply(split_dataset$train, vfold_cv, v = folds)
for (i in 1:length(split_dataset$folds)) {
name = names(split_dataset$folds[i])
# for metrics
split_dataset$fit_resample_train[[name]] =
lapply(
workflows,
tune::fit_resamples,
resamples = split_dataset[["folds"]][[i]],
verbose = TRUE,
control = fit_control
)
# models
split_dataset$fit_model[[name]] =
lapply(
workflows,
parsnip::fit,
split_dataset[["train"]][[i]]
)
# extract model
split_dataset$models[[name]] =
lapply(
split_dataset$fit_model[[i]],
workflows::extract_fit_parsnip
)
}
## Compare results
# Define grids for plots contour
for (i in 1:length(grids)) {
for (m in 1:length(split_dataset$models)){
grids$fitted[[paste0("dataset",i)]][[names(split_dataset$models[[i]][m])]] <-
grids[[i]] %>%
bind_cols(
predict(split_dataset$models[[i]][[m]], new_data = grids[[i]]),
predict(split_dataset$models[[i]][[m]], new_data = grids[[i]], type = "prob")
) %>%
mutate(
.pred_1 = round(.pred_1, 3),
.pred_0 = round(.pred_0, 3),
decision = .pred_1 - .pred_0
)
}
}
# Generate new plots
for (p in 1:length(plots)) {
for (g in 1:length(grids$fitted)){
plots$model_decision[[paste0("dataset",p)]][[names(grids$fitted[[p]][g])]] <-
plots[[p]] +
geom_contour(data = grids$fitted[[p]][[g]], aes(x = x1,y = x2, z = decision), color = "Purple", breaks = 0, size = 1, alpha = 0.5) +
geom_point(data = grids$fitted[[p]][[g]], aes(x = x1, y = x2, color =.pred_class ), size = 0.1, alpha = 0.1)
}
}
results <- (list(plots = plots, models = split_dataset, grids = grids))
return(results)
}
sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Portuguese_Portugal.1252 LC_CTYPE=Portuguese_Portugal.1252
## [3] LC_MONETARY=Portuguese_Portugal.1252 LC_NUMERIC=C
## [5] LC_TIME=Portuguese_Portugal.1252
##
## attached base packages:
## [1] stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] forcats_0.5.1 stringr_1.4.0 readr_2.1.2 tidyverse_1.3.1
## [5] MASS_7.3-57 kknn_1.3.1 gridExtra_2.3 nnet_7.3-17
## [9] keras_2.9.0 kernlab_0.9-30 LiblineaR_2.10-12 C50_0.1.6
## [13] discrim_0.2.0 rpart.plot_3.1.1 rpart_4.1.16 yardstick_0.0.9
## [17] workflowsets_0.2.1 workflows_0.2.6 tune_0.2.0 tibble_3.1.7
## [21] rsample_0.1.1 recipes_0.2.0 parsnip_0.2.0 modeldata_0.1.1
## [25] infer_1.0.0 dials_0.1.1 scales_1.2.0 broom_0.8.0
## [29] tidymodels_0.1.4 magrittr_2.0.3 tidyr_1.2.0 mvtnorm_1.1-3
## [33] purrr_0.3.4 dplyr_1.0.9 ggplot2_3.3.6
##
## loaded via a namespace (and not attached):
## [1] readxl_1.4.0 backports_1.4.1 plyr_1.8.7 igraph_1.3.1
## [5] splines_4.1.3 listenv_0.8.0 tfruns_1.5.0 digest_0.6.29
## [9] foreach_1.5.2 htmltools_0.5.2 fansi_1.0.3 tzdb_0.3.0
## [13] globals_0.15.0 modelr_0.1.8 gower_1.0.0 hardhat_0.2.0
## [17] colorspace_2.0-3 rvest_1.0.2 haven_2.5.0 xfun_0.31
## [21] crayon_1.5.1 jsonlite_1.8.0 libcoin_1.0-9 zeallot_0.1.0
## [25] survival_3.3-1 iterators_1.0.14 glue_1.6.2 gtable_0.3.0
## [29] ipred_0.9-12 future.apply_1.9.0 DBI_1.1.2 Rcpp_1.0.8.3
## [33] isoband_0.2.5 Cubist_0.4.0 reticulate_1.25 GPfit_1.0-8
## [37] Formula_1.2-4 lava_1.6.10 prodlim_2019.11.13 httr_1.4.3
## [41] RColorBrewer_1.1-3 ellipsis_0.3.2 pkgconfig_2.0.3 farver_2.1.0
## [45] sass_0.4.1 dbplyr_2.1.1 utf8_1.2.2 tidyselect_1.1.2
## [49] labeling_0.4.2 rlang_1.0.2 DiceDesign_1.9 reshape2_1.4.4
## [53] munsell_0.5.0 cellranger_1.1.0 tools_4.1.3 cli_3.3.0
## [57] generics_0.1.2 evaluate_0.15 fastmap_1.1.0 yaml_2.3.5
## [61] knitr_1.39 fs_1.5.2 future_1.26.1 whisker_0.4
## [65] xml2_1.3.3 compiler_4.1.3 rstudioapi_0.13 png_0.1-7
## [69] reprex_2.0.1 lhs_1.1.5 bslib_0.3.1 stringi_1.7.6
## [73] highr_0.9 lattice_0.20-45 Matrix_1.4-1 tensorflow_2.9.0
## [77] vctrs_0.4.1 pillar_1.7.0 lifecycle_1.0.1 furrr_0.3.0
## [81] jquerylib_0.1.4 R6_2.5.1 bookdown_0.26 renv_0.15.5
## [85] parallelly_1.31.1 codetools_0.2-18 assertthat_0.2.1 withr_2.5.0
## [89] parallel_4.1.3 hms_1.1.1 grid_4.1.3 timeDate_3043.102
## [93] class_7.3-20 rmarkdown_2.14 inum_1.0-4 partykit_1.2-15
## [97] pROC_1.18.0 lubridate_1.8.0 base64enc_0.1-3
Logistic regressions and Nearest Neighbor methods are oposing extremes of the Bias-Variance scale. A logistic regression estimates \(Pr(G = 1 | X)\) using the Logistic function \(p(X) = \frac{e^{\beta_{0} + \beta_{1}X}}{1+\beta_{0} + \beta_{1}X }\). This translates to a linear classification boundary when ploting with independent variables.
On the other hand, Nearest Neighbor makes no assumptions and relies solely on local (neighbor) information to predict the \(Pr(G = 1 | X)\). This approach makes for a very flexible model but highly sensitive to newer information.
Given each approach oposing characteristics, our initial hypothesis is that, given a dataset with a perfectly linear decision boundary, will be a easily estimated by the biased Logistic regression but the NN approach, given its nature, will struggle since points close to the border will have a strong influence. Therefore, we defined the comparing datasets as follows:
James et al. (2013) In a p-dimensional space, a hyperplane is a flat affline subspace of dimension p-1. For instance, in two dimensions, a hyperplane is a flat one-dimensional subspace-in other words, a line. In two dimensions the hyperplane is defined as
\[\beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} = 0\]
decision_fun_linear <- function(x1, x2){
res <- ifelse(x2 >= x1, 1, 0)
return(res)
}
decision_fun_quadratic <- function(x1, x2){
res <- ifelse(x2 >= abs( (1.2 * x1 - 5)^2 + 2 ), 1, 0)
return(res)
}
dataset_linear <- dataset_gen_unif(class_fun = decision_fun_linear, size = 1000)
dataset_quadratic <- dataset_gen_unif(class_fun = decision_fun_quadratic, size = 1000)
grid.arrange(
dataset_linear$border_plot + labs(subtitle ="Linear border", color = "" ),
dataset_quadratic$border_plot + labs(subtitle = "Quadratic border", color = ""),
nrow = 1,
top = "Synthetic Generated Datasets",
bottom = grid::textGrob(
"Dashed line represent optimal bayes decision boundary",
gp = grid::gpar(fontface = 3, fontsize = 9)
)
)
The following workflow (using Kuhn and Wickham (2021)) was executed in order to fit and evaluate each model given the above defined datasets:
Train - Test split by 80% Train - 20% Test,
Define 10 random folds splitting Train and Validation,
Fit models to each fold,
Calculate Fold metrics,
Fit to all train data and extract model,
Create plots with estimated decision boundaries
# define workflows
### Logistic regression
# 1. specify the model
logistic_reg_glm_spec <-
parsnip::logistic_reg(mode = "classification") %>%
parsnip::set_engine('glm', family = "binomial")
# 2. preprocessing
preprocess <-
recipes::recipe(g ~ x1 + x2 , data = dataset_linear$dataset)
# 3, Buildworkflow
logit_wflow <-
workflows::workflow() %>%
workflows::add_model(logistic_reg_glm_spec) %>%
workflows::add_recipe(preprocess)
### SVM radial
# 1. specify the model
nearest_neighbor_kknn_spec <-
parsnip::nearest_neighbor() %>%
parsnip::set_engine('kknn') %>% # Defaultk = 5
parsnip::set_mode('classification')
# 2. preprocessing
preprocess <-
recipes::recipe(g ~ x1 + x2 , data = dataset_linear$dataset) # just to set the relation, irrelevant which dataset used
# 3, Buildworkflow
knn_wflow <-
workflows::workflow() %>%
workflows::add_model(nearest_neighbor_kknn_spec) %>%
workflows::add_recipe(preprocess)
data <- list(dataset_linear, dataset_quadratic)
workflows <- list(logistic = logit_wflow, knn = knn_wflow)
compare_fit <- model_fit_compare(data = data, workflows = workflows)
From the above plots we can conclude that the results are much in line with our initial hypothesis. Given a linear boundary, the logistic regression was able to match to perfection while the knn with k = 3 was influenced by observations closer to the border. On the other hand, the logistic regression was unable to classify when a border was a quadratic function.
The following plot showcases the performance of each model calculated on each of the 10 validation folds. It is clear the struggle of the logistic model on a quadratic boundary.
Below the results on the test set which confirm our preliminary conclusions
Given is simplicity and explicit difference between each class, both model perform very well on a dataset with a clear linear bondary. That is visible on both the training and test dataset although with an edge towards the logistic regression.
The Nearest-Neighbor method (K-NN): The K-NN method is a non-parametric statistical pattern recognition procedure and among the various non-parametric techniques is the most intuitive, but nevertheless possesses powerful statistical properties (Toth et al., 2000).
ANN methodology: There are many ways of using ANNs in the context of RR models (Anctil et al., 2004) like network topology, training algorithm, input selection and network size optimization and each of them has a priori experience-based assumptions. Neural Networks distribute computations to processing units called neurons, grouped in layers. Three different layer types can be distinguished: input layer, connecting the input information to the network, one or more hidden layer and acting as intermediate computational layers between input and output and output layer, producing the final output (Toth et al., 2000).
The MLP dataset is basedon a normal distibution but without clearly separated boundaries, so we expected that MLP model works well in this data.
On the other hand the knn dataset have a circular perfect boundary, so we expect that knn will be better in this dataset, even the diferences in this case is lower, because the MLP leads also weel with this boundary because can be seen as a linear boundary.
## Warning: The `...` are not used in this function but one or more objects were passed: 'verbose'
## The `...` are not used in this function but one or more objects were passed: 'verbose'
## The `...` are not used in this function but one or more objects were passed: 'verbose'
## The `...` are not used in this function but one or more objects were passed: 'verbose'
The plots below show the resulting decision boundaries
The plots shows the evolution of key metrics over crossvalidation trainning
The plots show the metrics between different models
As we so in the above plots, the results are in line with our expectations, and we have better performance in MLP with dataset 1 (MLP dataset), a a bit better in knn with dataset 2 (knn dataset), but as refered the diference are not huge, so if we tune the parameters we can achieve a good performance with MLP for this dataset as well.
A neural network will almost always have the same activation function in all hidden layers.
It is most unusual to vary the activation function through a network model.
Traditionally, the sigmoid activation function was the default activation function in the 1990s. Perhaps through the mid to late 1990s to 2010s, the Tanh function was the default activation function for hidden layers.
The hyperbolic tangent activation function typically performs better than the logistic sigmoid.(Page 195, Deep Learning, 2016)
Both the sigmoid and Tanh functions can make the model more susceptible to problems during training, via the so-called vanishing gradients problem.
The activation function used in hidden layers is typically chosen based on the type of neural network architecture.
Modern neural network models with common architectures, such as MLP and CNN, will make use of the ReLU activation function, or extensions.
In modern neural networks, the default recommendation is to use the rectified linear unit or ReLU (Page 174, Deep Learning, 2016.)
## Warning: The `...` are not used in this function but one or more objects were passed: 'verbose'
## The `...` are not used in this function but one or more objects were passed: 'verbose'
## The `...` are not used in this function but one or more objects were passed: 'verbose'
## The `...` are not used in this function but one or more objects were passed: 'verbose'
The plots below show the resulting decision boundaries
## Warning: stat_contour(): Zero contours were generated
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
The plots shows the evolution of key metrics over crossvalidation trainning
The plots show the metrics between different models
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis eu dictum lorem, et placerat dui. Donec porttitor posuere nisi, non tempor tellus ornare ut. Vestibulum vehicula libero eget consequat lobortis. Suspendisse vel arcu et urna iaculis mattis. Nulla ut mattis est. Pellentesque eu eleifend augue. Maecenas placerat tortor tincidunt risus rutrum tincidunt.
Similarly to Logistic regression, Linear Discriminant Analysis (LDA) estimates each class by modeling the conditional distribution \(Pr(G = 1 | X)\), but using a different approach. While linear regression uses the Logitisc function for this purpose, LDA assumes each observation is drawan from a multivariated normal distribution with similar covariance and uses this distribution to model the conditional distribution. Therefore, on many ocasions, the output of both Logistic and LDA will be quite similar. Hastie, Tibshirani, and Friedman (2009) in their experience, this models give very similar results.
Tree based methods take a very different approach to the problem. They involve stratiying or segmenting the predictor space into a number of simple regions.(see James et al. (2013) pag-303). A metric like the mean or median is then used as predictor for each region or cut.
This oposing technics have pronounced characteristics making then appropriate to dataset with specific characteristics. Whenever the classes are based on very pronounced cuts or classes, tree based approaches and their rule based classification will tend to fit best. On the other hand, on its simple form as decision tree, they tend to perform worse then tradional parametric conterparts as LDA when the classification border follows a clear distribution.
Therefore, we defined the comparing datasets as follows:
James et al. (2013) In a p-dimensional space, a hyperplane is a flat affline subspace of dimension p-1. For instance, in two dimensions, a hyperplane is a flat one-dimensional subspace-in other words, a line. In two dimensions the hyperplane is defined as
\[\beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} = 0\]
decision_fun_linear <- function(x1, x2){
res <- ifelse(x2 >= x1, 1, 0)
return(res)
}
dataset_linear <- dataset_gen_unif(class_fun = decision_fun_linear, size = 1000)
decision_fun_normal <- function(x1, x2){
mu <- c(6,6)
cvar <- matrix(c(0.5,0,0,0.5), 2, 2)
p <- mvtnorm::pmvnorm(c(x1,x2), mean = mu, sigma = cvar)
res <- ifelse(p[1] <= 0.5, 1, 0)
return(res)
}
dataset_dtree <- dataset_gen_unif(class_fun = decision_fun_normal)
grid.arrange(
dataset_dtree$border_plot + labs(subtitle = "Segmented datase", color = ""),
dataset_linear$border_plot + labs(subtitle = "Linear border", color = ""),
nrow = 1,
top = "Synthetic Generated Datasets",
bottom = grid::textGrob(
"Dashed line represent optimal bayes decision boundary",
gp = grid::gpar(fontface = 3, fontsize = 9)
)
)
The following workflow (using Kuhn and Wickham (2021)) was executed in order to fit and evaluate each model given the above defined datasets:
Train - Test split by 80% Train - 20% Test,
Define 10 random folds splitting Train and Validation,
Fit models to each fold,
Calculate Fold metrics,
Fit to all train data and extract model,
Create plots with estimated decision boundaries
The decision tree is fitted using the following default parameters:
# define workflows
### LDA
# 1. specify the model
discrim_linear_MASS_spec <-
discrim_linear() %>%
set_mode("classification") %>%
set_engine("MASS")
# 2. preprocessing
preprocess <-
recipes::recipe(g ~ x1 + x2 , data = dataset_linear$dataset)
# 3, Buildworkflow
lda_wflow <-
workflows::workflow() %>%
workflows::add_model(discrim_linear_MASS_spec) %>%
workflows::add_recipe(preprocess)
### Decision tree
# 1. specify the model
decision_tree_rpart_spec <-
decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
# 2. preprocessing
preprocess <-
recipes::recipe(g ~ x1 + x2 , data = dataset_linear$dataset) # just to set the relation, irrelevant which dataset used
# 3, Buildworkflow
dtree_wflow <-
workflows::workflow() %>%
workflows::add_model(decision_tree_rpart_spec) %>%
workflows::add_recipe(preprocess)
data <- list(dataset_linear, dataset_dtree)
workflows <- list(lda = lda_wflow , dtree = dtree_wflow)
compare_fit <- model_fit_compare(data = data, workflows = workflows)
As expected when facing a perfectly linear boundary the LDA outperformed the decision tree. Given its segmented approach a decision tree will try to fit an increasing number of segments into a line. In theory, if this process will let to continue without any limitation it would be possible for the process to define a significantly high number of segments that would very much resemble a line. This is hardly a computational efficient approach. (as said above the tree depth is 30).
On the other hand LDA failed to capture a clear segmented are from the dataset since it tryied to fit a linear border into the data.
Below the results on the test set which confirm our preliminary conclusions
Our analysis shows that when a clear segmentation of class exists then a decision tree can outperform LDA under certain conditions.
<<<<<<< HEAD:notebooks/3-lda_vs_qda.Rmd # Linear Discriminante Analysis vs Quadratic Discriminante Analysis ======= # Linear Discriminante Analysis vs Quadratic Discriminate Analysis >>>>>>> c5a570a8d341739b8e44d012d8ee61d03e26740d:notebooks/03-lda_vs_qda.Rmd
Linear Discriminant Analysis LDA assumes normally distributed data,a class-specific mean vector and assumes a common covariance matrix. So, a covariance matrix that is common to all classes in a data set. When these assumptions hold, then LDA approximates the Bayes classifier very closely and the discriminant function produces a linear decision boundary.
Quadratic Discriminant Analysis
QDA assumes a normal distribution (same as LDA) and assumes that each class has its own covariance matrix (different from LDA). When these assumptions hold, QDA approximates the Bayes classifier very closely and the discriminant function produces a quadratic decision boundary.
So, we construct two datasets: 1. LDA dataset: The dataset have 1000 samples and two variables that have very similar covariance between them.
Therefore we are expecting that the LDA models shows better results with the the LDA dataset, even though the QDA do not show a huge diference in this dataset. The QDA model will work better in the QDA dataset, because the assumptions of LDA model do not hold in this dataset.
# define workflows
###Fitting
## Warning: The `...` are not used in this function but one or more objects were passed: 'verbose'
## The `...` are not used in this function but one or more objects were passed: 'verbose'
## The `...` are not used in this function but one or more objects were passed: 'verbose'
## The `...` are not used in this function but one or more objects were passed: 'verbose'
The plots below show the resulting decision boundaries
The plots shows the evolution of key metrics over crossvalidation trainning
The plots show the metrics between different models
## Conclusion
Given the strong assumptions of LDA models on the covariance of the classes it is easy to understand that results. The QDA relaxation in relation to this allow the good performance in the QDA dataset, but also a good result in the LDA dataset.
On most occasions we can expect LDA and Logistic to have similar performances. Nonetheless, the underlying assumption that each feature follows a normal distribution, makes LDA more sensitive to outliers specialy on a context ot smaller number of observations.
LDA classification rule depends on the mean of all of the obsevations within each class, as well as the within-class covariance matrix computed using all of the observations. In contrast, logistic regression, unlike LDA, has very low sensitivity to observations far from the decision boundary. (James et al. (2013))
To demonstrate our hypothesis we defined two datasets as follows:
A dataset of 1000 pair of (x1,x2) observations from 2 multivariable normal distribution each representing a class. The dataset is perfectly balanced;
A dataset of 50 pair of (x1,x2) observations from 2 multivariable normal distribution each representing a class. The dataset balance is 70/30. The most extreme value was multiplied by a factor in order to increase his impact over the mean.
The following workflow (using Kuhn and Wickham (2021)) was executed in order to fit and evaluate each model given the above defined datasets:
Train - Test split by 80% Train - 20% Test,
Define 10 random folds splitting Train and Validation,
Fit models to each fold,
Calculate Fold metrics,
Fit to all train data and extract model,
Create plots with estimated decision boundaries
# define workflows
### LDA
# 1. specify the model
discrim_linear_MASS_spec <-
discrim_linear() %>%
set_mode("classification") %>%
set_engine("MASS")
# 2. preprocessing
preprocess <-
recipe(g ~ x1 + x2 , data = dataset_mv$dataset)
# 3, Buildworkflow
lda_wflow <-
workflow() %>%
add_model(discrim_linear_MASS_spec) %>%
add_recipe(preprocess)
### Logistic
# 1. specify the model
logistic_reg_glm_spec <-
logistic_reg(mode = "classification") %>%
set_engine('glm', family = "binomial")
# 2. preprocessing
preprocess <-
recipe(g ~ x1 + x2 , data = dataset_mv$dataset)
# 3, Buildworkflow
log_wflow <-
workflow() %>%
add_model(logistic_reg_glm_spec) %>%
add_recipe(preprocess)
data <- list(dataset_mv, dataset_mv_s_outlier)
workflows <- list(log = log_wflow, lda = lda_wflow)
compare_fit <- model_fit_compare(data = data, workflows = workflows)
Despite still being able to fit a model, as expected, the pull from the outliers has a higher effect on LDA when compared to Logistic.
## Warning: Removed 1 rows containing missing values (geom_point).
## Removed 1 rows containing missing values (geom_point).
Despite its similarities we demonstrated that under certain conditions the logistic regression can outperform the LDA approach. Because of its higher sensibility to outlier on a scenario of a small dataset with outliers, they will have a bigger impact on the classification border.
| ds2_log | ds2_lad | ds1_log | ds1_lad |
|---|---|---|---|
| 0.875 | 0.875 | 0.9993995 | 0.9993995 |
Both are non-parametric. This means that the data distribution cannot be defined in a few parameters. In other words, Decision trees and KNN’s don’t have an assumption on the distribution of the data. While decision tree supports automatic feature interaction, KNN doesn’t.Moreover decision trees can be faster, however, KNN tends to be slower with large datasets because it scans the whole dataset to predict as it doesn’t generalize the data in advance.
Decision Trees: Advantages:
Decision trees are effective in capturing non-linear relationships which can be difficult to achieve with other algorithms like Support Vector Machine and Linear Regression.
Easy to explain to people: This is a great aspect of decision trees. The outputs are easy to read without requiring statistical knowledge or complex concepts.
Some people believe decision trees more closely mirror human decision-making than others like regression and classification approaches.
Trees can be displayed graphically and can be easily interpreted by non-experts.
Decision trees can easily handle qualitative (categorical) features without the need to create dummy variables.
Disadvantages:
don’t have the same level of predictive accuracy as some of other regression and classification approaches
trees can be non-robust. Eg. small change in the data can cause a large change in the final estimated tree
As the tree grows in size, it becomes prone to overfitting and requires pruning
KNN: Advantages:
Simple and intuitive: Similar to decision trees it is simple and easy to explain to laypeople.
Non-parametric, therefore, it doesn’t have any assumptions on the data distribution
No training step: KNN is more of an exception to the general machine learning workflow. It doesn’t have a training/validation/test set. The model created with KNN is available in a labeled dataset, placed in metric space. Say, if you want to classify any object, the model has to read through all the data and compare the distances of the closest objects.
Easy to implement for multi-class problems: Compared to other algorithms, it is very easy to predict multiclass problems. Just supply the ‘k’ a value that is equivalent to the number of classes and you are good to go.
Few hyperparameters: When working with K-NN, you just need to provide two parameters, k (the numbers of neighbors to consider) and the choice of Distance Function (e.g. Euclidean, Manhattan distance).
Used for classification and regression: It can be used for classification and regression
Instance-based learning (lazy learning): You don’t need to fit a model in advance, just provide the data point and it will give you the prediction.
Disadvantages:
Slow with a larger dataset. If it is going to classify a new sample, it will have to read the whole dataset, hence, it becomes very slow as the dataset increases.
Curse of dimensionality: KNN is more appropriate to use when you have a small number of inputs. If the number of variables grows, the KNN algorithm will have a hard time predicting the output of a new data point.
Feature inputs need to be scaled: It is a must that the features should be scaled. KNN uses distance criteria, like Euclidean or Manhattan distances, therefore, it is very important that all the features have the same scale.
Outlier sensitivity: KNN is very sensitive to outliers. Since it is an instance-based algorithm based on the distance criteria, if we have some outliers in the data, it is liable to create a biased outcome.
Missing Value not treated: It is not capable of treating or dealing with missing values
Class imbalance can be an issue: If we have an imbalanced class data, the algorithm might wrongly pick the majority class.
For dataset of KNN we chose a dataset with a similar distribution and dispersion between the class, to avoid the confusion of the model in the bodaries areas. For dataset for decision tree,we generate data without clear boundaries and a unbalanced number of samples between the classes, because we kwon this can affect the perform of KNN.
## Warning: The `...` are not used in this function but one or more objects were passed: 'verbose'
## The `...` are not used in this function but one or more objects were passed: 'verbose'
## The `...` are not used in this function but one or more objects were passed: 'verbose'
## The `...` are not used in this function but one or more objects were passed: 'verbose'
The plots below show the resulting decision boundaries
In the above graphs
The plots shows the evolution of key metrics over crossvalidation trainning
The plots show the metrics between different models
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc macro_weighted 0.949
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc macro_weighted 0.919
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc macro_weighted 0.717
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc macro_weighted 0.813
Decision tree is used to partition the data to find accurate result but in KNN it used to find similar values from the data. Each and every algorithm has its own merits and demerits and never ever all algorithms have satisfied all criteria and requirements. Each algorithm has its own specification, so algorithms should be chosen according to the requirements.
Boosting works by growing trees sequentially, each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling, instead each tree fit on a modified version of the original dataset. (James et al. (2013))
When compared to a simple decision tree we expect that a boosting model will be able to tackle more complex problems. We previously demonstrated that decesion trees strugle in the case of perfectly linear boundary. How will a boosted tree performe? Therefore we generated the following datasets:
James et al. (2013) In a p-dimensional space, a hyperplane is a flat affline subspace of dimension p-1. For instance, in two dimensions, a hyperplane is a flat one-dimensional subspace-in other words, a line. In two dimensions the hyperplane is defined as
\[\beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} = 0\]
The following workflow (using Kuhn and Wickham (2021)) was executed in order to fit and evaluate each model given the above defined datasets:
Train - Test split by 80% Train - 20% Test,
Define 10 random folds splitting Train and Validation,
Fit models to each fold,
Calculate Fold metrics,
Fit to all train data and extract model,
Create plots with estimated decision boundaries
The decision tree is fitted using the following default parameters:
Boosted tree:
Attention: adaBoost package is not implemented out the box in Parsnip, the fitting package and approach used on this project. Instead of going for a manual implementation we opted to use the C5.0 implementation on boosted trees which very mush resembles the same approach as adaboost.
## Warning: The `...` are not used in this function but one or more objects were passed: 'verbose'
## The `...` are not used in this function but one or more objects were passed: 'verbose'
## The `...` are not used in this function but one or more objects were passed: 'verbose'
## The `...` are not used in this function but one or more objects were passed: 'verbose'
The plots show the metrics between different models
Undoubtedly the boosted approach outperformed a decision tree when a linear border scenario (although still behind lda or logistic).
| ds2_boosting | ds2_dtree | ds1_boosting | ds1_dtree |
|---|---|---|---|
| 0.9042281 | 0.876023 | 0.9952192 | 0.9502085 |
We expect support vector machines with a linear kernel to perform very weel in case where the classification border follows a perfect line. On the other hand, non linear boundaries will lead to poor performance. For this example we set a very extrem case where a boundary is a perfectly centered circle.
James et al. (2013) In a p-dimensional space, a hyperplane is a flat affline subspace of dimension p-1. For instance, in two dimensions, a hyperplane is a flat one-dimensional subspace-in other words, a line. In two dimensions the hyperplane is defined as
\[\beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} = 0\]
The following workflow (using Kuhn and Wickham (2021)) was executed in order to fit and evaluate each model given the above defined datasets:
Train - Test split by 80% Train - 20% Test,
Define 10 random folds splitting Train and Validation,
Fit models to each fold,
Calculate Fold metrics,
Fit to all train data and extract model,
Create plots with estimated decision boundaries
## Warning: The `...` are not used in this function but one or more objects were passed: 'verbose'
## The `...` are not used in this function but one or more objects were passed: 'verbose'
## Setting default kernel parameters
## Warning: The `...` are not used in this function but one or more objects were passed: 'verbose'
## The `...` are not used in this function but one or more objects were passed: 'verbose'
## Setting default kernel parameters
## maximum number of iterations reached 0.0006068825 -0.0006062308
The plots below show the resulting decision boundaries
The plots show the metrics between different models
As expected, a linear kernel performed very poorly in the case where a non linear border existed. So poorly that in our experiment it couldn’t even fit a model to the data since it couldn’t find a single model which minimized the loss function.
| ds2_svm_linear | ds2_svm_radial | ds1_svm_linear | ds1_svm_radial |
|---|---|---|---|
| 0.6272727 | 0.9975758 | 1 | 1 |
Both approaches are fitted for non linear decision borders, but how do they deal when outliers exist and the number of observations is low? Similar to what we did when comparing LDA against Logistic regression, we defined two unbalanced dataset and in one we increase the impact of outliers.
The following workflow (using Kuhn and Wickham (2021)) was executed in order to fit and evaluate each model given the above defined datasets:
Train - Test split by 80% Train - 20% Test,
Define 10 random folds splitting Train and Validation,
Fit models to each fold,
Calculate Fold metrics,
Fit to all train data and extract model,
Create plots with estimated decision boundaries
## Warning: The `...` are not used in this function but one or more objects were passed:
## 'verbose'
## ! Fold04: internal: No event observations were detected in `truth` with event level '0'.
## ! Fold07: internal: No event observations were detected in `truth` with event level '0'.
## ! Fold08: internal: No event observations were detected in `truth` with event level '0'.
## Warning: The `...` are not used in this function but one or more objects were passed:
## 'verbose'
## ! Fold04: internal: No event observations were detected in `truth` with event level '0'.
## ! Fold07: internal: No event observations were detected in `truth` with event level '0'.
## ! Fold08: internal: No event observations were detected in `truth` with event level '0'.
## Warning: The `...` are not used in this function but one or more objects were passed:
## 'verbose'
## ! Fold02: internal: No event observations were detected in `truth` with event level '0'.
## Warning: The `...` are not used in this function but one or more objects were passed:
## 'verbose'
## ! Fold02: internal: No event observations were detected in `truth` with event level '0'.
The plots below show the resulting decision boundaries
The plots shows the evolution of key metrics over crossvalidation trainning
## Warning: Removed 3 rows containing missing values (geom_point).
## Removed 3 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
## Removed 1 rows containing missing values (geom_point).
The plots show the metrics between different models
As expected the presence of outliers decreased the performance of radial svm
| ds2_svm_poly | ds2_svm_radial | ds1_svm_poly | ds1_svm_radial |
|---|---|---|---|
| 1 | 1 | 1 | 1 |
In general ANN methods as semi-parametric methods have many advantages such as, allow a large number of variables in the model, no need to assumptions such as normality and…, finding the models despite missing data, detection of complex and nonlinear relationship between independent and dependent variables.
Although in theory and practical studies have been hinted ANN have better performance than statistical methods, but it has some disadvantage such as, the accuracy of the results depends largely on the size of the training set, requires the initialization and adjustment of many individual parameters to optimize the classification performance, standardized coefficients and odds ratios corresponding to each independent variable cannot be easily calculated, weights are generated in a neural network analysis, but their interpretation is difficult, the weights may be influenced by the program used to generate them.(Teshnizi SH, Ayatollahi SM. A Comparison of Logistic Regression Model and Artificial Neural Networks in Predicting of Student’s Academic Failure. Acta Inform Med. 2015 Oct;23(5):296-300. doi: 10.5455/aim.2015.23.296-300. Epub 2015 Oct 5. PMID: 26635438; PMCID: PMC4639347.)
So, we expect that a MLP works better in a dataset with non-linear boundary, and the logistic regression will fits weel in a dataset with a linear boundary.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis eu dictum lorem, et placerat dui. Donec porttitor posuere nisi, non tempor tellus ornare ut. Vestibulum vehicula libero eget consequat lobortis. Suspendisse vel arcu et urna iaculis mattis. Nulla ut mattis est. Pellentesque eu eleifend augue. Maecenas placerat tortor tincidunt risus rutrum tincidunt.
## Warning: The `...` are not used in this function but one or more objects were passed: 'verbose'
## The `...` are not used in this function but one or more objects were passed: 'verbose'
## The `...` are not used in this function but one or more objects were passed: 'verbose'
## The `...` are not used in this function but one or more objects were passed: 'verbose'
The plots below show the resulting decision boundaries
## Warning: stat_contour(): Zero contours were generated
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
As expected, the MLP model had better result in the non-linear bondary dataset, while the other dataset had better results in the logistic regression, evan these difference is not so huge that in the case of the other dataset.
The plots shows the evolution of key metrics over crossvalidation trainning.
The plots show the metrics between different models
In comparison with the conventional LR model, the ANN model was more accurate in general predicting tasks. Therefore, based on the results, it seems that for classification of a dichotomous dependent variable, artificial neural network methods are appropriate to be used.